Measuring Spatial Ability for Talent Identification, Educational Assessment, and Support: Evidence from Adolescents with High Achievement in Science, Arts, and Sports

Background Spatial ability (SA) is a robust predictor of academic and occupational achievement. The present study investigated the psychometric properties of 10 tests for measuring of SA in a sample of talented schoolchildren. Objective Our purpose was to identify the most suitable measurements for SA for the purpose of talent identification, educational assessment, and support. Design Our sample consisted of 1479 schoolchildren who had demonstrated high achievement in Science, Arts, or Sports. Several criteria were applied to evaluate the measurements, including an absence of floor and ceiling effects, low redundancy, high reliability, and external validity. Results Based on these criteria, we included the following four tests in an Online Short Spatial Ability Battery “OSSAB”: Pattern Assembly; Mechanical Reasoning; Paper Folding; and Shape Rotation. Further analysis found differences in spatial ability across the three groups of gifted adolescents. The Science track showed the highest results in all four tests. Conclusion Overall, the study suggested that the Online Short Spatial Ability Battery (OSSAB) can be used for talent identification, educational assessment, and support. The analysis showed a unifactorial structure of spatial abilities. Future research is needed to evaluate the use of this battery with other specific samples and unselected populations.

For example, individuals from Project Talent (Flanagan et al., 1962) with more pronounced spatial ability (compared to verbal ability) were more involved in math and science courses in high school (Wai et al., 2009). ey were also more likely to choose the STEM elds for future education, while those with the opposite pattern (verbal ability advantage over spatial) were more likely to choose educational programs and careers focused on education, humanities, and social sciences.
Moreover, it appears that the likelihood of obtaining an advanced degree in STEM (from a BSc to a PhD) increases as a function of spatial ability: 45% of all those holding STEM PhDs scored within the top 4% on spatial ability 11 years earlier; and nearly 90% of all those holding STEM PhDs were in top 23% or above. Similarly, about 30% of those holding STEM terminal master's degrees, and 25% of those holding STEM terminal bachelor's degrees, also scored in the top 4% of spatial ability (Wai et al., 2009).
Another study (Kell, Lubinski, Benbow, & Steiger, 2013) examined the spatial ability data for 563 participants from the Study of Mathematically Precocious Youth (SMPY; Shea et al., 2001). Levels of spatial ability, measured at age 13-14, added explanatory power 35 years later, accounting for 7.6% of the variance in creative achievement (number of patents and published articles), in addition to the 10.8% of variance explained by scores on the mathematics and verbal sections of the Scholastic Assessment Test (SAT). Lubinsky and team emphasized the necessity of adding a spatial assessment to talent search programs. is might help children and adolescents with high levels of spatial ability to reach their full potential. Without formal identi cation, spatially gi ed adolescents may lack opportunities to develop their skills (Lohman, 1994;Lubinski, 2016), and even disengage from education (Lakin & Wai, 2020).
Despite being a robust predictor of future STEM achievement, spatial ability assessment is o en not included in talent searches. is is because time for such assessments is generally limited and focused mostly on the numerical and verbal domains (Lakin & Wai, 2020). Few studies have examined the role of spatial ability in high achievement in nonacademic domains, such as sports and the arts. e results of existing studies are inconsistent, with some nding such links (Blazhenkova & Kozhevnikov, 2010;Hetland, 2000;Ivantchev, & Petrova, 2016;Jansen, Ellinger, & Lehmann, 2018, Notarnicola et al., 2014Ozel, Larue, & Molinaro, 2002, 2004Stoyanova, Strong & Mast, 2018), and others failing to do so (Chan, 2007;Heppe, Kohler, Fleddermann, & Zentgraf, 2016;Sala & Gobet, 2017). One way to improve understanding of the role of SA in high achievement is to use the same test battery in samples selected for high achievement in di erent domains. To our knowledge, our study is the rst to carry out such an investigation.
Irrespective of achievement domain, it is not clear which spatial abilities are most relevant. Numerous spatial ability tests are available which tap into supposedly di erent processes, such as spatial information processing, mental rotation, spatial visualization, or manipulation of 2D and 3D objects (Uttal, Meadow, Tipton, Hand, Alden, & Warren, 2013).
However, several recent studies (Esipenko et al., 2018;Likhanov et al., 2018;Malanchini et al., 2019;Rimfeld et al, 2017) showed that spatial ability might have a unifactorial rather than multidimensional structure. For example, research has shown that the 10 spatial ability tests which form a King's Challenge test battery (Rimfeld et al., 2017), constitute a single factor in British and Russian samples, explaining 42 and 40 percent of overall variance in spatial ability measures, respectively (Likhanov et al., 2018;Rimfeld et al., 2017). Interestingly, in a Chinese sample assessed with the same battery, a two-factorial structure of spatial ability emerged (explaining 40% of the total variance), with Cross-sections and Mechanical Reasoning forming a separate factor. Further research is needed to identify the sources of these di erences across the samples. e unifactorial structure of spatial ability was further demonstrated in another study that examined 16 measures of spatial ability in a UK sample (Malanchini et al., 2019). In this study, three factors emerged: navigation, object manipulation, and visualization; these in turn loaded strongly on a general factor of spatial ability. e unifactorial structure found in the UK and Russian samples suggests that, at least in these populations, a smaller number of tests can be used for rapid assessment of spatial ability. e main purpose of the current study was to identify the most suitable spatial ability tests for creating a short online battery for educational assessment and talent identi cation. To this end, we investigated the psychometric properties of 10 spatial ability tests, as well as performance on these tests, in three adolescent samples selected for high achievement in science, arts, or sports. Comparison between these areas of expertise may provide additional insight into the role of spatial ability in these areas.
As the study was largely exploratory, we investigated the following research questions rather than testing speci c hypotheses: Research question 1: What are the best performing spatial ability tests in terms of psychometric properties?
Research question 2: What is the relationship between spatial ability and the three areas of expertise: Science, Sports, and Arts?
Research question 3: Does the previously shown unifactorial structure of spatial ability replicate in these expert samples?

Participants
e study included 1470 adolescents, who were recruited at the Sirius educational center in Russia (645 males, 468 females, and 357 participants who did not provide information on gender). e ages of the participants ranged from 13 to 17 years (M = 14.78,SD = 1.20). Sirius is an educational center which provides intensive fourweek educational programs for schoolchildren who have demonstrated high achievement in Science, Arts, or Sports. Adolescents from all regions of Russia are invited to apply for participation in these educational programs. Participation, as well as travel and other expenses, are free for participants. e socio-economic status (SES) of the participants was not measured. However, the participants likely represented a wide range of SES backgrounds, since the program application is open for everyone, participants come from all Russian geographic regions, and participation is fully funded.
We invited high-achievers to participate in one of the three tracks, selected on the basis of the following criteria: -Science (339 males, 208 females): high school achievement, such as winning in a subject Olympiad (maths, chemistry, physics, informatics, IT, biology, etc.); or excellent performance in a scientific project; -Arts (50 males, 198 females): winning in different competitions and demonstrating high achievement in painting, sculpture, choreography, literature, or music; -Sports (220 males, 55 females): participation and winning in high-rank sport competitions (hockey, chess, and figure skating). Due to the limited sample size, we were not able to analyze di erences within the tracks (e.g., math vs. chemistry; sculpture vs. choreography; or chess vs. hockey). We plan to explore those di erences once the sample size needed for such research is achieved.

Procedure
e study was approved by the Ethical Committee for Interdisciplinary Research. Parents or legal guardians of participants provided written informed consent. Additionally, verbal consent was obtained from the participants before the study. e testing took place in the regular classrooms of the educational center, which are quite similar to each other.

Measures
King's Challenge battery. Participants were presented with a gami ed online battery called the "King's Challenge" (KC), which had a test-retest reliability of r = 0.65 on average for the 10 spatial tests (Rimfeld et al., 2017); the battery was adapted for administration in Russian. e battery consists of 10 tests (see Table 1) and is gami ed, with a general theme of building a castle and defending it against enemies. When they nished the battery, participants received feedback on their performance.
We used the total of all correct items to score each test for use in further analysis. A total score for all 10 tests was computed by summing up the scores for each (KC Total), following the procedure described by Rimfeld and colleagues (2017).
Non-verbal intelligence. Non-verbal intelligence was measured by a shortened version of the Raven's progressive matrices test (Raven, Raven, & Court, 1998). e test was modi ed to included six (only odd) items from the C, D, and E series, and three items from the F series ( e A and B series were excluded). A discontinuation rule was applied in order to reduce the duration of the test: a series was terminated a er three incorrect responses, and the test automatically progressed to the next series (in the F series, the test terminated immediately). e percentage of all correct responses out of the total number of 21 items was used for analysis. Academic achievement. We used self-reported school Year grades for Math (Year grade Math) and the Russian Language (Year grade Rus). ese grades are awarded by teachers to assess a student's performance for the whole school year in a respective subject (based on performance across the year). e grading system is 1 to 5, where 1 = "terrible/fail"; 2 = "bad/fail"; 3 = "satisfactory"; 4 = "good"; and 5 = "excellent". A 1 is practically never given, and a 2 is given only rarely (see Likhanov et al., 2020, for a discussion of the limitations of this grading system). In our sample, we had a restricted range of Year grades, with no 1 and 2 grades, since students who received these marks are unlikely to be invited to Sirius. e data for Year grades was available for 1109 participants.
We also collected self-reported grades for the State Final Assessment, a standardized exam herea er referred to as the Exam. is test, taken at the end of 9th grade (15-16 years of age), is a measurement of students' performance that serves as a major educational assessment tool. In the current study, only scores for the Math (Exam Math) and Russian language (Exam Rus) exams were used. Exam marks range from 1 to 5. No participants in our study had a 1 or 2 on this exam. e data for Exam results Note: Example items for each test are provided in the Supplementary Materials provided at the conclusion of this article. You will nd the gures included there referenced with the S pre x in the text. Detailed information on the battery can be found in Rimfeld et al., 2017. was available for only 306 participants, since not all study participants were of the age to undergo this exam at the time of data collection.

Spatial test selection criteria
In order to select the most informative spatial tests for educational assessment and talent search, we focused on six characteristics: 1. Absence of floor and ceiling effects -clustering of participants' scores towards the worst or best possible scores (reflecting the unsuitability of the test difficulty level for the sample); 2. Differentiating power -the ability of the test to differentiate between Science, Arts, and Sports tracks in terms of average performance and distribution; 3. Low redundancy -this criterion allowed us to exclude tests which demonstrated very high correlations (above .7) with other tests in the battery; 4. Specificity -identifying tests that had small factor loadings on the latent "spatial ability" factor and/or loading on an additional factor, potentially suggesting specificity; 5. High reliability -having sufficiently high (.8) internal consistency; 6. High external validity -having common variance with non-verbal intelligence and educational achievement measures.
To check for oor and ceiling e ects, we examined descriptive statistics, the shapes of distributions, and percentages of the highest and lowest values in each test. Distribution shapes also provided information on track di erences. Di erentiating power was further assessed with a series of ANOVAs. Factor structure was investigated by Principal Component Analysis (PCA). We also explored intercorrelations among all spatial measures to identify redundant tests indicated by strong bivariate correlations. Internal consistency was measured by the split-half reliability test, which randomly divides the test items into halves several times and compares the correlations between the two halves. External validity was assessed by correlating SA test scores with measures of non-verbal intelligence and academic achievement in Math and the Russian language.
Outliers were not deleted from the dataset, as we expect a signi cant proportion of children in this sample to demonstrate high performance in SA. For example, some studies showed that adolescents selected for math ability score higher than the third quartile of distribution in SA tests (see Benbow 1992;Lubinski & Dawis, 1992 for discussion), which is usually recognized as a threshold for outliers (Tukey, 1977). Similarly, some participants from non-academic tracks might show particularly low scores since they were not selected for the program based on academic achievement, or due to their investment of e ort in sport or music training. For this reason, low outliers were also kept in the data set. e percentage of outliers ranged from 0.5 to 8.6% of the sample. Data on the number of outliers are presented in Table S10. (See Supplemental Materials) Most of the analysis was done in SPSS 22.0. R 3.1 was used to clean the data, to calculate split-half reliability analysis and to draw correlation heatmaps.

Data Analysis
e main purpose of the current study was to identify the most suitable spatial ability tests for creation of a short online battery for educational assessment and talent identi cation. Speci cally, we examined six test characteristics as described in the method section. Descriptive statistics for the whole sample and for di erent tracks separately are presented in Tables 2 and 3. Figure S1 (See Supplemental Materials) presents distributions for all tests for each track. e numbers di ered for di erent measurements: for spatial ability measurements, the missing data ranged from 52 to 264, as some participants did not complete the whole battery; for Year grades, the missing data ranged from 359 to 402, as these participants did not report their grades. In addition, as explained above, the data for Exams was available only for the older subsample which had completed the Exam. In most analyses reported in this paper, we used the data for the maximum number of participants which was available for each measure. Table 2 Descriptive statistics for the whole sample: number of correct responses in spatial ability measures, exam and year grades, and non-verbal intelligence

Test (number of items) N Mean (SD) Min Max Skewness
Cross-sections (15)  Note. Total = total score for King's Challenge battery; the number of items in each test is presented in brackets; * Raven's score is calculated by dividing the number of correct answers by the total number of items; ^ e N for Exam was low because most of the study participants had not reached the age when this Exam is taken. Note. e number of items (possible range) is shown in brackets next to each test name with the name of the subtest. KC Total = total score for King's Challenge battery; the number of items in each test is pre-sented in the brackets; * Raven's score is calculated by dividing the number of correct answers by the total number of items; ^Total score for 2D and 3D drawing tasks had decimals as a score for an individ-ual trial in both tests ranged from 0 to 1, re ecting the number of correct lines drawn in the time given for this trial.
Absence of oor and ceiling e ects. Mechanical reasoning and Mazes demonstrated normal distribution, both across and within tracks. For Shape rotation, Paper folding, and Pattern assembly, the scores were negatively skewed for the Science track and positively skewed for the Sports tracks. Shape rotation, Paper folding, and Cross-sections tests demonstrated bimodal distributions for the whole sample. e ceiling e ect for the whole sample was observed for the 2D-drawing and Elithorn mazes tests: in the 2D-drawing test, 43% of participants had scores of 4 or 5 (out of 5); in the Elithorn mazes test, 53% of participants had scores from 8 to 10 (out of 10). e oor e ect was present in 3D-drawing and Perspective-taking tests: for the 3D-drawing test, 46.9% of participants had scores of 2 or lower (out of 7), and for Perspective-taking test, 54% of participants had scores of 3 or lower (out of 15).
For further investigation of the oor and ceiling e ects, we estimated the difculty of each test by calculating the percentages of correct responses (see Table S1). For the whole sample, the Elithorn mazes and 2D-drawing were the easiest tests in the battery (77.7% and 68% of responses correct, respectively), whereas Perspectivetaking was the most di cult one (28.2% responses correct).
Di erentiating power. We used ANOVA to examine potential di erences among the Science, Arts, and Sports tracks. As described in the Method section, gender distribution across tracks was uneven. Previous studies that employed the same SA battery showed moderate gender di erences in a British sample of young adults (Toivainen et al, 2018) and samples of Russian (Esipenko et al., 2018) and Chinese students (Likhanov et al., 2018). We examined gender e ects in 11 one-way ANOVAs (10 tests and the total score) that showed male advantage for three tests, as well as a total SA score, and female advantage for two tests. All e ects were negligible to modest (between .004 and .05; See Table S2 for details). Gender was regressed out in all further analyses. erea er, these standardized residuals were used in one-way ANOVAs to compare educational tracks (Science, Arts, and Sports). Homogeneity of variance was assessed by the Levene's test (Levene, 1960). Welch's ANOVA was used to account for the heterogeneity of variance in some tests (Field, 2013). Variance heterogeneity among tracks was found for all tests (p ≤ 0.01), with the exception of Mechanical reasoning (p = 0.25) and Shape rotation (p = 0.13).
Overall, the ANOVAs showed signi cant average di erences across the three tracks in every spatial measure and the total score, with e ect sizes (ŋ²) ranging from .13 to .65. e results of Welch's F-tests, p-values, and ŋ² are presented in Table S3. Due to non-normal distribution within tracks in all tests, with the exception of Mechanical reasoning and Mazes, we conducted non-parametric tests to con rm the results of the ANOVA. e Kruskal-Wallis H test con rmed signi cant di erences between tracks in all spatial tests and total scores (χ 2 (3, N = 1070) = [133.1 -423.5]; p < .01). Means for all SA tests according to track are presented in Figure 1. Post-hoc analyses showed that each track signi cantly di ered from each other track in each test (p < .05 for all comparisons). e science track had the highest scores and the Sports track had the lowest.
Signi cant di erences across the tracks were also found for non-verbal intelligence (F (2, 980) = 19.42; p < .01; ŋ² = .31), with means of .83 (SD = .12), .73 (SD = .15), and .60 (SD = .18) for the Science, Arts, and Sports tracks, respectively.   Low Redundancy. All pairwise correlations were signi cant and positive, ranging from r = .34 to r = .85 (Tables S4 for within-track correlations). e data showed the highest correlations for the 3D-drawing, 2D-drawing, and Paper folding tests (>.67), which suggests that having all of them in one battery is unnecessary. Elithorn mazes and Mazes tests showed the lowest correlations with other spatial ability tests within the Arts track and the whole sample.
Speci city. We performed Principal Component Analysis (PCA) on the raw data (sum of the correct responses for each spatial test) for the whole sample and individual tracks. To ensure that the data was suitable for factor analysis, we applied the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy and the Bartlett's test of sphericity for both the whole sample and each track separately (see Table S5). e results indicated that the data was suitable for factor analysis (Hair et al., 1998). For the whole sample, the PCA scree plot (see Figure S2) and the eigenvalues suggested single factor extraction (explaining 56.48% of variance; see Table 5). All tests showed high loadings on this factor (.58 -.85). For the Science and Sports tracks, the factor structure was also unifactorial: a single factor explained 45.76% and 38.74% of variance, respectively. For the Arts track, two factors explained 50.41% of variance: factor 1 = 39.68%; and factor 2 = 10.79%. Factor 1 included all tests except the Elithorn mazes and Mazes, which formed factor 2. ese ndings indicate that one test from a battery would be able to assess the underlying spatial ability factor to some degree. Factor loadings and eigenvalues for the whole sample and each track separately are shown in Table 5.
Reliability. Split-half reliabilities for the whole sample and separate tracks are shown in Table S6. Split-half reliability varied from weak to strong across the tests in the whole sample (r = .27 -.95). High reliabilities (> .8) were shown for Cross-sections, 2D drawing, Pattern assembly, Paper folding, 3D drawing, Shape rotation, and Perspective-taking. Moderate reliabilities were shown (>.65) for Mechanical reasoning and Mazes. Low reliability (.27) was shown for Elithorn mazes. e pattern of reliability was similar for all tracks.
External validity. Table 6 presents the correlations between the spatial ability tests, Raven's progressive matrices, and academic achievement for the full sample (see Tables S7 -S9 for correlations within tracks). All tests showed signi cant positive weak to strong correlations with non-verbal intelligence: r (1325) = [.38 -.62], p ≤ .01 for the whole sample and within tracks.
For the whole sample, SA was correlated with the Year grades for both Mathematics (r(1056) = [.24 -.49], p ≤ .01), and the Russian language, (r (1107) = [.12 -.30], p ≤ .01.) Fisher's r-z transformation showed that correlations were higher for Math than for Russian (z = [3.9 -5.88], p ≤ .01), with the exception of Elithorn mazes. e pattern of correlations between the students' Year grades and SA tests was slightly di erent within tracks (see Table S10). On the Science track, there were signi cant weak to moderate correlations between SA tests and Year grade for Mathematics (r (547) = [.12 -.30], p ≤ .01), but no correlations between spatial tests and the Year grade for the Russian language. On the Arts and Sports tracks, there were consistent signi cant correlations between the Year grades in Math and SA, and some between Year grades in Russian and SA (Fisher's Z was non-signi cant).
Tables S10 and S11 present the results for correlations between SA and the Exam. In the whole sample, the Math Exam showed weak to moderate correlations with SA (r(304) = [.20 -.34], p ≤ .05); the Russian Exam was only weakly correlated with SA (r(304) = [.12 -.16], p ≤ .05). Within tracks, only a few correlations between SA and Exam reached signi cance.

Tests selected for inclusion in the Online Short Spatial Ability Battery ( OSSAB).
Four of the tests matched the criteria for selection, including the predicted pattern of moderate correlations with nonverbal intelligence and mathematics achievement (e.g., Tosto et al., 2014). Below we describe the selected tests: 1. Paper Folding is a widely used measure of spatial visualization (Carrol, 1993), which has previously been recommended for talent identification (Hegarty & Waller, 2005;Linn & Petersen, 1985;Uttal et al., 2012). In the present study, Paper Folding appeared very similar to 2D and 3D drawing tests in correlational patterns, discriminant validity, factor loadings, and reliability. However, 2D and 3D drawing tests were excluded, as they showed either ceiling or floor effects; 2. Shape Rotation taps into a different dimension of spatial ability -mental rotation (Shepard & Metzler, 1971). This parameter was selected as it matched all established criteria, including high reliability and different distributions for the different tracks; 3. Mechanical Reasoning taps into a construct of Mechanical Aptitude -the ability to understand and apply mechanical concepts and principles to solve problems (Wiesen, 2015); it is recognized as important in educational tracking and career planning (Muchinsky, 1993). We selected the Mechanical Reasoning test, which showed better results than Cross-sections and Elithorn mazes in terms of normally distributed scores for all three tracks, as well as significant track differences; 4. Pattern assembly measures spatial relations -another important aspect of spatial ability (Carrol, 1993). This test showed the same pattern of distribution across tracks (along with Shape Rotation and Paper Folding), as well as high reliability, high factor loadings, and good correlations with other tests. By contrast, Mazes had low correlations with other tests and low discriminant validity; and Perspective-taking had high reliability, factor loadings, and correlations with other tests, but showed a strong floor effect.

Discussion
e purpose of the present study was to investigate the psychometric properties and factor structure of 10 spatial ability tests in order to create a short battery suitable for educational assessment and talent search. We collected data using an existing extensive spatial ability battery (King's Challenge; Rimfeld et al., 2017) in a sample of schoolchildren who had demonstrated high achievement in Science, Arts, or Sports. Based on our analysis, four tests were identi ed to be included into an Online Short Spatial Ability Battery "OSSAB. " e following four best-performing tests were selected: Paper Folding, Shape Rotation, Mechanical Reasoning, and Pattern Assembly. All selected tests are available at https://github.com/fmhoeger/OSSAB. We analyzed our data to demonstrate the utility of the OSSAB for educational purposes. In particular, we ran the analysis by splitting the sample into three educational tracks (Science, Arts, and Sports). e analysis showed signi cant di erences between tracks, with ŋ² ranging from .32 to .67. For example, the Science track showed the highest results in all four tests. We also compared the results of the Science track with previous results and found higher average performance in the Science track than that of unselected university students from China and Russia (Esipenko et al., 2018;Likhanov et al., 2018) and of an unselected population of young adults from UK (Rimfeld et al., 2017). Our result was also consistent with repeatedly found correlations between math and spatial ability (.43;Tosto et al., 2014), and between intelligence and academic achievement (.60 -.96; Bouchard & Fox, 1984;Deary, Strand, Smith, & Fernandes, 2007;Kemp, 1955;Wiseman, Meade, & Parks, 1966). Considering that SA was not part of the admission criteria for the Science track, the results suggest that SA might be a useful marker for high STEM performance. ese results provide further support for adding SA tests to verbal and math tests in order to establish patterns of strengths and weaknesses that can be predictive of future achievement in di erent domains (Shea, Lubinski, & Benbow, 2001;Webb, Lubinski, & Benbow, 2007). Moreover, talent search programs that focus mostly on verbal and math ability may overlook people with high SA only, which may lead to disengagement and behavioral problems in these young people (Lakin & Wai, 2020). ese individuals will bene t from early identi cation of their high SA, and from personalized educational programs that capitalize on their strengths, including such activities as electronics, robotics, and mechanics.
For the Sports track, a positive skew was shown in Shape rotation, Paper folding, and Pattern assembly. It is possible that the relatively low performance of the Sports track on SA and other cognitive and academic achievement measurements is the result of these students' extreme investment of e ort in sports training (see Likhanov, 2021, in preparation;for discussion). It is also common for athletes to disengage from traditional academic studies (Adler & Adler, 1985) and fall behind academically (e.g., due to attending training camps). SA training that involves more enjoyable activities -for example, using computer games and VR or AR (augmented reality) (Uttal et al., 2014, Papakostas et al., 2021) -might be bene cial for their academic performance.
It is also possible that the battery used in this study did not tap into the ability of athletes to process visuo-spatial information in a natural environment, such as attentional processes or long-term working memory, which was shown in some studies to be highly developed in professional athletes, including hockey players (Belling et al 2015;Mann et al, 2007;Voss et al., 2010). e tests in this study measured mostly small-scale SA, i.e., the ability to mentally represent and transform two-and threedimensional images that can typically be apprehended from a single vantage point (Likhanov et al., 2018;Wang and Carr, 2014). Further research is needed that includes both small-and large-scale spatial ability tests.
For the Arts track, the average performance fell somewhere in between the Science and Sports tracks. is track is heterogeneous, but the sample size was not large enough to investigate spatial ability in separate sub-tracks (e.g., ne arts vs. music). erefore, in this study, the Arts track can be considered unselected in terms of academic achievement.
Cross-track di erences also emerged in the structure of SA. Results from the factor analysis for the whole sample on the Science and Sports tracks replicated the previous ndings of the unifactorial structure of the spatial ability (Esipenko et al., 2018;Likhanov et al., 2018;Rimfeld et al., 2017). However, for the Arts track, a two-factorial structure emerged (Elithorn mazes and Mazes tests formed the second factor).
A number of speculative explanations for this can be proposed. e Arts track included high achievers in music (20%), literature (40%), and ne art (30%). e second factor may re ect an advanced ability of the ne art students to process visual information in two-dimensional space, as these two tests are hypothesized to measure an ability for 2D image scanning (Poltrock, & Brown, 1984). Alternatively, a number of methodological issues may also have led to the second factor on the Arts track.
e two tests showed lower correlations with other spatial ability measures (lower than .26) for the Arts track, which could have stemmed from the smaller sample size for this track (though su cient, e.g., according to Comrey and Lee, 1992) and lower reliability of the two tests.

Conclusion
e Online Short Spatial Ability Battery (OSSAB) can be used for talent identi cation, educational assessment, and support. Future research is needed to evaluate the use of this battery with other speci c samples and unselected populations.

Limitations
Our study had a number of limitations. Firstly, sample sizes di ered among sex and track groups, precluding ne-grained investigation of these e ects. Secondly, the study had only limited access to students' academic achievement: the majority of the sample had not yet taken the state exam; and the Year grades only provided a very crude estimate of achievement as they range from 2 to 5, with 2 absent from this sample. irdly, as mentioned above, large-scale spatial ability was not measured in the current study. Further research is needed to evaluate the relative strengths and weak-nesses in small-and large-scale spatial abilities for di erent tracks. Fourthly, there were some di erences in reliability across measures. Moreover, some tests could be more enjoyable. Future research needs to explore whether and how enjoyment may be related to the test validity.

Ethics Statement
Our study was approved by the Ethics Committee for Interdisciplinary Research of Tomsk State University, approval № 16012018-5.

Informed Consent from the Participants' Legal Guardians
Written informed consent to participate in this study was provided by the participant's parents, legal guardian, or next of kin. And also an oral consent of the minor was provided at the moment of the testing.

Author Contributions
A.B. and M.L. planned the study and data collection. M.L. signi cantly contributed to the text of the manuscript. A.B., A.Z. and E.S. did the data collection and wrote the rst dra . E.B did the statistical analysis. T.T. made a contribution to the text of the paper. Y.K. conceived of the idea and supervised work on the study and reviewed the paper. All authors discussed the results and contributed to the nal manuscript.

Con ict of Interest
e authors declare no con ict of interest.      Note: * = p≤0.05. ** = p≤0.001; S-HR = split-half reliability, SD = standard deviation for split-half reliability.